The main goal of this study is to develop a model to predict the sale price of a house. A well performing model should provide an acceptable level of predictive accuracy while using only a limited number of features with the aim of reducing complexity and overfitting. A model of reduced size seeks to acheive a useful balance between the prediction and explanation goals.
For this study, we will also perform some exploratory data analysis to explain how the effect of a continous predictor on the response varies across different treatment groups of a categorical predictor. Namely, we’ll test:
\(H_0: \beta_{j,LivArea:Neighborhood} = 0, \forall j \text{ - The mean price per square foot of a house is the same for every neighborhood j }\)
\(H_1: \beta_{j,LivArea:Neighborhood} \neq 0, \forall j \text{ - At least one neighborhood j has a different mean price per square foot than others }\),
using a confidence level of \(\alpha= 0.01\).
We will also attempt to infer which neighborhoods place the highest monetary values on living area and determine whether the living area predictor is collinear with other predictors.
To develop the model, we will use a dataset describing houses sold between 2006 and 2010 in Ames, Iowa. This data set was compiled by Dean DeCock in 2011, and it provides over 2800 observations with 81 features (23 nominal/categorical, 23 ordinal, 14 discrete, and 21 continuous).
The features can be roughly divided into 6 categories.
We will begin by loading the housing data stored in AmesHousing.csv.
library(readr)
housing_data = read_csv("data/AmesHousing.csv")
There are two variables that do not describe house qualities. PID is the unique identifier of each observation and Order is a counter for each observation.
### Remove non-relevant variables
housing_data = subset(housing_data, select = -Order)
housing_data = subset(housing_data, select = -PID)
Next, we’ll inspect the strucutre of the dataset.
dim(housing_data) ## Dimension of the dataset.
## [1] 2930 80
table(sapply(housing_data, class)) ## Variable types
##
## character integer
## 44 36
We can observe that the dataset is composed by 2930 observations and 80 variables. There are only character and integer variables. A description of each variable can be found at: https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt.
We’ll examine whether there are NA values in the dataset and decide how to manage them.
hasNA = colSums(is.na(housing_data))[colSums(is.na(housing_data)) > 0]
hasNA
## Lot Frontage Alley Mas Vnr Type Mas Vnr Area Bsmt Qual
## 490 2732 23 23 80
## Bsmt Cond Bsmt Exposure BsmtFin Type 1 BsmtFin SF 1 BsmtFin Type 2
## 80 83 80 1 81
## BsmtFin SF 2 Bsmt Unf SF Total Bsmt SF Electrical Bsmt Full Bath
## 1 1 1 1 2
## Bsmt Half Bath Fireplace Qu Garage Type Garage Yr Blt Garage Finish
## 2 1422 157 159 159
## Garage Cars Garage Area Garage Qual Garage Cond Pool QC
## 1 1 159 159 2917
## Fence Misc Feature
## 2358 2824
27 variables contain NA values. Some of those variables have a high number NAs, therefore we decided to remove any variable with more than 400 NAs:
# Keep only those variables that has less than 400 NAs
housing_data = subset(housing_data, select = colSums(is.na(housing_data)) < 400)
After filtering, we now have 2930 observations and 74 variables.
In order to avoid any issues with the remaining observations that have NAs we are removing those as well:
housing_data = na.omit(housing_data)
After filtering, we now have 2678 observations and 74 variables.
TODO: Is this fair? Should we come up with a another justification other than citing the author? Not sure…
There are 5 unusual observations that were pointed out by the author of the dataset. These are easily observable by plotting SalePrice (price of the house) vs Gr Liv Area (above ground living area square feet) as follows:
plot(housing_data$SalePrice ~ housing_data$`Gr Liv Area`,
col = ifelse(housing_data$`Gr Liv Area` > 4000, "orange", "dodgerblue"),
xlab = "Ground Living Area (square feet)",
ylab = "Sale Price (US Dollars)",
main = "Sale Price vs Ground Living Area",
pch = 20,
cex = ifelse(housing_data$`Gr Liv Area` > 4000, 2, 0.7)
)
We observe that 3 points that have an extremely low price for a huge ground living area. The other two points seem to be priced correctly but are heavy outliers.
## Remove observations with living area larger than 4000 sq ft.
housing_data = housing_data[housing_data$`Gr Liv Area` < 4000, ]
We also note that two neighborhoods have less than 3 observations. A sample size smaller than 3 for a categorical group is not sufficiently large to derive meaningful estimates.
table(housing_data$Neighborhood)
##
## Blmngtn Blueste BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert
## 28 10 29 93 42 262 101 139 159
## Greens GrnHill IDOTRR Landmrk MeadowV Mitchel NAmes NoRidge NPkVill
## 8 1 69 1 25 101 416 69 23
## NridgHt NWAmes OldTown Sawyer SawyerW Somerst StoneBr SWISU Timber
## 163 131 203 140 107 170 51 38 70
## Veenker
## 24
## Remove observations belonging neighborhoods smaller than 3
housing_data = housing_data[-which(housing_data$Neighborhood %in% c("GrnHill","Landmrk")),]
Lastly, it is necesary to coerce all character columns as factors to fit the categorical predictors of the model:
# Determine variables that are of type character
char_var = lapply(housing_data, class) == "character"
# Coerce those columns to factor
housing_data[, char_var] = lapply(housing_data[, char_var], as.factor)
As stated in the Introduction the goal of the project is to obtain a model for predicting sales price, but trying to keep it simple. In other words, finding a good balance between prediction and explanation.
Before playing around with model selection, let’s divide the data into a training and a test set. That will help us evaluate the final model chosen. We’ll use the seed 420 for any random process so we can reproduce the results if needed.
set.seed(420)
# Get training indexes randomly (80% of the observations)
train_index <- sample(seq_len(nrow(housing_data)), size = floor(0.8 * nrow(housing_data)))
# Divide the data
train_hd <- housing_data[train_index, ]
test_hd <- housing_data[-train_index, ]
After splitting the data we now have 2136 observations for training and 535 observations for test.
In order to assess our models we are going to use the set of auxiliary functions defined on week’s 9 assignment. We are also defining a macro function that calls all those auxiliary functions and show the results in a single call. This overview encompasses the p-value of the normality and constant variance assumptions through Shapiro and Breusch test, the Fitted vs Residuals plot, the normal Q-Q plot, the number of parameters used (betas), the LOOCV RMSE, and the adjusted \(R^2\).
# Library required for Breusch-Pagan test
library(lmtest)
# Plot Fitted vs Residuals
plot_fitted_resid = function(model) {
plot(fitted(model), resid(model),
col = "dodgerblue", pch = 20, cex = .7,
xlab = "Fitted", ylab = "Residuals",
main = "Fitted vs residuals plot")
abline(h = 0, col = "darkorange", lwd = 1)
}
# Plot Normal Q-Q
plot_qq = function(model) {
qqnorm(resid(model), col = "dodgerblue", pch = 20, cex = .7)
qqline(resid(model), col = "darkorange", lwd = 1)
}
# Breusch-Pagan test (constant variance)
get_bp_decision = function(model) {
bptest(model)$p.value
}
# Shapiro-Wilk test (normality)
get_sw_decision = function(model) {
shapiro.test(resid(model))$p.value
}
# Number of parameters
get_num_params = function(model) {
length(coef(model))
}
# LOOCV RMSE
get_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
# Adjusted R^2
get_adj_r2 = function(model) {
summary(model)$adj.r.squared
}
# Function that combines the previously defined functions
get_overview = function(model) {
par(mfrow=c(1,2)) #Plot two graphs in the same image
plot_fitted_resid(model)
plot_qq(model)
loocv_rmse = get_loocv_rmse(model)
adj_r2 = get_adj_r2(model)
bp_p_value = get_bp_decision(model)
shapiro_p_value = get_sw_decision(model)
betas = get_num_params(model)
list(loocv_rmse = loocv_rmse, adj_r2 = adj_r2, bp_p_value = bp_p_value, shapiro_p_value = shapiro_p_value, betas = betas)
}
We can start the process of finding our model by using an additive model as a baseline:
model_add = lm(SalePrice ~ ., data = train_hd)
Let’s run the overview function to assess this baseline model:
get_overview(model_add)
## $loocv_rmse
## [1] Inf
##
## $adj_r2
## [1] 0.9334242
##
## $bp_p_value
## BP
## 1.341791e-22
##
## $shapiro_p_value
## [1] 1.678226e-32
##
## $betas
## [1] 249
From the results above you can note that there are troubles on every resulting parameter, except the adjusted \(R^2\). LOOCV RMSE is inf (because there are hat values equal to 1), both assumptions are violated (p-values are very low), and the number of parameters used is huge (because of the categorical variables, we end up having more betas than the number of predictors available). The Fitted vs Residuals plot also seems to violate both, the 0 mean at any fitted value and the constant variance at any fitted value (no linearity and no constant variance).
Based on the normal Q-Q plot is evident that there is a huge deviation at the edges of the plot. We can apply a log transformation on the response and check whether this minimizes the effect:
model_add_log = lm(log(SalePrice) ~ ., data = train_hd)
get_overview(model_add_log)
## $loocv_rmse
## [1] Inf
##
## $adj_r2
## [1] 0.9362686
##
## $bp_p_value
## BP
## 4.827006e-17
##
## $shapiro_p_value
## [1] 1.375178e-35
##
## $betas
## [1] 249
The Shapiro’s p-value is worse, but the normal Q-Q plot looks better. This can be due to some outliers that can be observed on the plot.
The outliers can be identified by using plot function:
par(mfrow=c(2,2))
plot(model_add_log)
From the resulting plots we can see that observations 1682, 578, and 213 seem to be outliers. Let’s remove them:
train_hd = train_hd[-c(1682, 578, 213),]
model_add_log = lm(log(SalePrice) ~ ., data = train_hd)
get_overview(model_add_log)
## $loocv_rmse
## [1] Inf
##
## $adj_r2
## [1] 0.9449537
##
## $bp_p_value
## BP
## 7.455871e-24
##
## $shapiro_p_value
## [1] 7.691894e-24
##
## $betas
## [1] 249
It now looks a bit better, but there is still a noticeable violation of normality at the edges. The Fitted vs Residuals plot also looks better, especially the 0 mean at any fitted value.
We tried to find outliers or influential points, but using the heuristics of the book is not possible to find any:
# Number of influential points in the transformed model
sum(cooks.distance(model_add_log) > 4 / length(cooks.distance(model_add_log)))
## [1] NA
# Number of outliers in the transformed model
sum(abs(rstandard(model_add_log)) > 2)
## [1] NA
We can also see the positive effect of the log function in the response by looking at its histogram. Without log:
hist(train_hd$SalePrice,
breaks = 30,
main = "Histogram of Sale Price without transformation",
xlab = "Sale Price",
col = "gray")
As you can see this plot is right skewed. This can be normalized by using log:
hist(log(train_hd$SalePrice),
breaks = 30,
main = "Histogram of Sale Price with log transformation",
xlab = "Sale Price",
col = "gray")
We can notice that the obtained plot is much better and the log transformation makes sense for the response.
Let’s use this transformed additive model as our baseline. There are still many paramaters that can be improved.
There are many predictors to play with. We can try to choose a smaller base of “relevant” numerical predictors and then assess the addition of categorical variables through a forward process.
One way of doing this is by selecting those numerical predictors that are highly correlated to the response.
# Get correlation list of all numerical variables against Sale Price
sale_price_cor = cor(train_hd[sapply(train_hd, is.numeric)])[,"SalePrice"]
sale_price_cor
## Lot Area Overall Qual Overall Cond Year Built
## 0.27204535 0.79426380 -0.14247853 0.54446953
## Year Remod/Add Mas Vnr Area BsmtFin SF 1 BsmtFin SF 2
## 0.53277472 0.49309078 0.42169310 -0.02516887
## Bsmt Unf SF Total Bsmt SF 1st Flr SF 2nd Flr SF
## 0.16654274 0.65473647 0.65230975 0.25015743
## Low Qual Fin SF Gr Liv Area Bsmt Full Bath Bsmt Half Bath
## -0.02144407 0.73065152 0.27390192 -0.07082941
## Full Bath Half Bath Bedroom AbvGr Kitchen AbvGr
## 0.55217520 0.25680213 0.14516273 -0.08280057
## TotRms AbvGrd Fireplaces Garage Yr Blt Garage Cars
## 0.52189041 0.45534222 0.52975789 0.65898016
## Garage Area Wood Deck SF Open Porch SF Enclosed Porch
## 0.65081976 0.30463025 0.32587883 -0.11539154
## 3Ssn Porch Screen Porch Pool Area Misc Val
## 0.02536506 0.10758806 0.04195898 -0.01490168
## Mo Sold Yr Sold SalePrice
## 0.05004228 -0.02129590 1.00000000
We can then remove those values that are low in magnitude. We’ll choose a threshold of 0.4:
# Filter correlations with magnitude smaller than a specified threshold
sale_price_cor = sale_price_cor[abs(sale_price_cor) > 0.4]
sale_price_cor
## Overall Qual Year Built Year Remod/Add Mas Vnr Area BsmtFin SF 1
## 0.7942638 0.5444695 0.5327747 0.4930908 0.4216931
## Total Bsmt SF 1st Flr SF Gr Liv Area Full Bath TotRms AbvGrd
## 0.6547365 0.6523098 0.7306515 0.5521752 0.5218904
## Fireplaces Garage Yr Blt Garage Cars Garage Area SalePrice
## 0.4553422 0.5297579 0.6589802 0.6508198 1.0000000
We now have a smaller group of numerical predictors to choose from. By reading the documentation of this set of parameters one can see that some of them are the result of the others or are very related. This obviously leads to collinearity issues.
Year built and year of remodelation are repetitive. If there was no remodelation the year built is used on Year Remod/Add. Let’s keep Year Remod/Add only.
BsmtFin SF 1 and Total Bsmt SF are also related. Let’s keep Total Bsmt SF which has a higher correlation with SalePrice. Same happens with 1st Flr SF and Gr Liv Area, we’ll keep Gr Liv Area.
There are also 3 numerical variables related to the garage. The Garage Cars seems to be the most relevant.
We can now simplify our sale_price_cor vector with the chosen variables:
# Auxiliary vector with names
var = names(sale_price_cor)
sale_price_cor = sale_price_cor[var != "Year Built" & var != "BsmtFin SF 1" & var != "1st Flr SF" & var != "Garage Yr Blt" & var != "Garage Area"]
sale_price_cor
## Overall Qual Year Remod/Add Mas Vnr Area Total Bsmt SF Gr Liv Area
## 0.7942638 0.5327747 0.4930908 0.6547365 0.7306515
## Full Bath TotRms AbvGrd Fireplaces Garage Cars SalePrice
## 0.5521752 0.5218904 0.4553422 0.6589802 1.0000000
Let’s fit a model with the chosen numerical variables and finish the collinearity study using vif analysis. Don’t forget that we are transforming the response using log.
# Fit a simple additive model with the chosen numerical variables
model_add_num = lm(log(SalePrice) ~ ., data = train_hd[, names(sale_price_cor)])
# Run vif test
library(faraway)
##
## Attaching package: 'faraway'
## The following object is masked from 'package:lattice':
##
## melanoma
vif(model_add_num)
## `Overall Qual` `Year Remod/Add` `Mas Vnr Area` `Total Bsmt SF`
## 2.472044 1.680024 1.308028 1.517613
## `Gr Liv Area` `Full Bath` `TotRms AbvGrd` Fireplaces
## 4.487689 2.098590 2.970419 1.339686
## `Garage Cars`
## 1.930246
There are no values over 5 which is the heuristic used in the course book that should cause concern. It seems that we removed the right numerical variables.
We can apply the get_overview function to evaluate assumptions as well:
get_overview(model_add_num)
## $loocv_rmse
## [1] 0.1416273
##
## $adj_r2
## [1] 0.8586069
##
## $bp_p_value
## BP
## 4.968603e-14
##
## $shapiro_p_value
## [1] 6.35861e-23
##
## $betas
## [1] 10
Everything looks similar to the additive model using all the predictors. This time the LOOCV RMSE is not inf which is also a good signal (no hat values of 1).
Let’s now check whether transforming any of these predictors makes sense. A visual inspection of pairs plot might suggest possible transformations.
pairs(train_hd[, names(sale_price_cor)])
Based on this output, it seems that trying a log transformation on Gr Liv Area and Total Bsmt SF can make sense to soften the effect ot extreme values:
model_add_num_log = lm(log(SalePrice) ~ `Overall Qual` + `Year Remod/Add` + `Mas Vnr Area` + log(`Total Bsmt SF`) + log(`Gr Liv Area`) + `Full Bath` + `TotRms AbvGrd` + `Fireplaces` + `Garage Cars`, data = train_hd)
get_overview(model_add_num_log)
## $loocv_rmse
## [1] 0.1442976
##
## $adj_r2
## [1] 0.8532499
##
## $bp_p_value
## BP
## 2.673067e-11
##
## $shapiro_p_value
## [1] 8.29974e-21
##
## $betas
## [1] 10
After the transformation the assumption parameters improved and the rest remain almost the same. We can keep these changes.
Let’s now try to simplify the model using a backward selection procedure and AIC.
model_add_num_log_back = step(model_add_num_log, direction = "backward", trace = 0)
model_add_num_log_back$coefficients
## (Intercept) `Overall Qual` `Year Remod/Add`
## 0.2040155504 0.0924130820 0.0036311220
## `Mas Vnr Area` log(`Total Bsmt SF`) log(`Gr Liv Area`)
## 0.0001084771 0.1973170774 0.3516903192
## `Full Bath` Fireplaces `Garage Cars`
## -0.0135658961 0.0640742108 0.0645105416
get_overview(model_add_num_log_back)
## $loocv_rmse
## [1] 0.1442363
##
## $adj_r2
## [1] 0.8532733
##
## $bp_p_value
## BP
## 6.120295e-12
##
## $shapiro_p_value
## [1] 9.024409e-21
##
## $betas
## [1] 9
The number of rooms above the ground were removed from the model. Let’s confirm that this is not a significant removal using an ANOVA test:
p_val = anova(model_add_num_log_back, model_add_num_log)$"Pr(>F)"[2]
p_val
## [1] 0.4162366
We failed to reject the null hypothesis for any rational significance level, which means that the simplified model can be used and the number of rooms above ground were not significant. We now have only 9 numerical variables.
Let’s see whether this model can be improved by adding categorical variables. This can be done using a forward selection procedure and BIC to get the smaller possible model:
# The scope includes all the filtered numerical variables plus all the available categorical variables
model_add_mix = step(model_add_num_log_back, direction = "forward", scope = SalePrice ~ `Overall Qual` + `Year Remod/Add` + `Mas Vnr Area` + log(`Total Bsmt SF`) + log(`Gr Liv Area`) + `Full Bath` + `Fireplaces` + `Garage Cars` + Street + `MS SubClass` + `MS Zoning` + `Lot Shape` + `Land Contour` + Utilities + `Lot Config` + `Land Slope` + Neighborhood + `Condition 1` + `Condition 2` + `Bldg Type` + `House Style` + `Roof Style` + `Roof Matl` + `Exterior 1st` + `Exterior 2nd` + `Mas Vnr Type` + `Exter Qual` + `Exter Cond` + Foundation + `Bsmt Qual` + `Bsmt Cond` + `Bsmt Exposure` + `BsmtFin Type 1` + `BsmtFin Type 2` + Heating + `Heating QC` + `Central Air` + Electrical + `Kitchen Qual` + Functional + `Garage Type` + `Garage Finish` + `Garage Qual` + `Garage Cond` + `Paved Drive` + `Sale Type` + `Sale Condition`, k = log(length(resid(model_add_num_log_back))), trace = 0)
We can now run our diagnostics for the obtained model:
get_overview(model_add_mix)
## $loocv_rmse
## [1] Inf
##
## $adj_r2
## [1] 0.9185498
##
## $bp_p_value
## BP
## 2.429563e-38
##
## $shapiro_p_value
## [1] 1.101009e-22
##
## $betas
## [1] 89
The assumptions plots remain the same. There is a noticeable improvement on adjusted \(R^2\) and our LOOCV RMSE is inf again. This is due to some hat values being 1. Let’s explore how many we have:
sum(hatvalues(model_add_mix) == 1)
## [1] 4
Since the number of conflicting hat_values is so low we can simply ignore them and calculate LOOCV RMSE manually:
loocv_rmse_final_mod = sqrt(mean((resid(model_add_mix)[-which(hatvalues(model_add_mix) == 1)] / (1 - hatvalues(model_add_mix)[-which(hatvalues(model_add_mix) == 1)])) ^ 2))
loocv_rmse_final_mod
## [1] 0.1114185
This is a much better value than our previous model, which is super good for prediction purposes. The model_add_mix is our final choice. We’ll assess it against the test set in the results section. We’ll also discuss the chosen model in general.
The chosen model to predict Sale Price for the given dataset was stored in model_add_mix. It uses 20 predictors out of 80 originally available, which is aligned to our initial commitment towards not only getting a good model for prediction, but also a balanced model between prediction and explanation. 12 of those predictors are categorical and 8 numerical.
To assess the final model, we can first focus on the prediction capabilities and then around explanatory capabilities.
From a prediction perspective we can first plot the fitted values versus the actual values to get a visual look of how the chosen model performs on the training set:
plot(train_hd$SalePrice, exp(fitted(model_add_mix)),
xlab = "Actual price (dollars)",
ylab = "Predicted price (dollars)",
main = "Predicted vs actual price in dollars for training set",
col = "orange")
abline(0, 1, lwd = 2, col = "dodgerblue")
ADD COMMENTS HERE (variance for high prices seem to increase. Model looks really good below 400 000 dollars)
Let’s do the same but this time for the test set. Let’s store the predicted values in a variable to be used on calculating more results.
# We are removing a conflicting observation for "Kitchen Qual" that is only present on the test set
# This is because there is only one observation with that value
test_hd = test_hd[-which(test_hd$`Kitchen Qual` == "Po"),]
# Predict values
predicted = exp(predict(model_add_mix, test_hd))
We can now plot for the test set. These are unbiased results that can tell us a lot about the chosen model:
plot(test_hd$SalePrice, predicted,
xlab = "Actual price (dollars)",
ylab = "Predicted price (dollars)",
main = "Predicted vs actual price in dollars for test set",
col = "orange")
abline(0, 1, lwd = 2, col = "dodgerblue")
We will begin by loading the housing data stored in AmesHousing.csv. We will also define a convenience function splitDataset that will randomly split the data into two data sets: “train” and “test”.
splitDataset = function(data) {
training = createDataPartition(data$SalePrice, times=1, p=0.80, list=FALSE)
return(list(train=data[training,], test=data[-training,]))
}
splits = splitDataset(housing_data)
training_data = splits$train
test_data = splits$test
Now we can begin our exploratory data analysis. To examine the relationship between Neighborhood and SalePrice, we will perform a simple one-way ANOVA to determine whether belonging to different neighborhoods has any effect on SalePrice at all.
neighborhood_oneway_anova = aov(SalePrice ~ Neighborhood, data=housing_data)
neigh_oneway_anova_summary = summary(neighborhood_oneway_anova)
A simple linear model would allow us to infer whether the Ground Living Area (Gr.Liv.Area) of a house has any relationship with SalePrice.
livarea_model = lm(SalePrice ~ `Gr Liv Area`, data=housing_data)
livarea_model_summary = summary(livarea_model)
And, in order to infer whether or not a house’s living area is more expensive in some neighborhoods than others, we’ll perform Analysis of Co-Variance (ANCOVA) by including an interaction term between the continous variable Gr.Liv.Area and the categorical variable Neighborhood. If the slopes of Gr.Liv.Area are different (not parallel) for different neighborhoods, then we would notice a strong interaction between the two predictors.
## Workaround
housing_data_alias = housing_data
housing_data_alias$Gr.Liv.Area = housing_data$`Gr Liv Area`
neigh_livarea_interaction_model = lm(SalePrice ~ Gr.Liv.Area*Neighborhood, data=housing_data_alias)
neigh_livarea_anova = anova(neigh_livarea_interaction_model)
p_value_neigh_livarea_interact =anova(neigh_livarea_interaction_model)[3,5]
We will also estimate actual slopes of the lines of SalePrice vs. Gr.Liv.Area for different neighborhoods to examine any differences.
livarea_slopes_raw = sim_slopes(neigh_livarea_interaction_model, pred=Gr.Liv.Area, modx=Neighborhood,
johnson_neyman = FALSE,centered= "none", confint=T)
livarea_slopes = data.frame(livarea_slopes_raw$slopes[,1:5],stringsAsFactors=FALSE)
livarea_slopes[,2:5] = sapply(livarea_slopes[,2:5], as.double)
livarea_slopes[,2:5] = sapply(livarea_slopes[,2:5], signif, digits=4)
colnames(livarea_slopes)[1] = "Neighborhood"
colnames(livarea_slopes)[3] = "Std. Err."
colnames(livarea_slopes)[4] = "2.5%"
colnames(livarea_slopes)[5] = "97.5%"
Lastly, in order to determine whether the effect of Gr.Liv.Area on SalePrice is not explained by other predictors, we’ll calculate the partial correlation coefficient between them with the effect of all other predictors removed.
From out first ANOVA of Neighborhood and SalePrice, it can easily be seen that the neighborhood had a significant effect on the price. We can also see this graphically using a box-plot.
neigh_oneway_anova_summary
## Df Sum Sq Mean Sq F value Pr(>F)
## Neighborhood 25 9.156e+12 3.662e+11 136.3 <2e-16 ***
## Residuals 2645 7.105e+12 2.686e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(SalePrice ~ Neighborhood, data = housing_data,las=2, col=2:8, pch=20, cex=.7,
main="Sale Prices by Neighborhood",
ylab="Sale Price (in US Dollars")
Similarly, it is easily shown that Gr.Liv.Area has an effect on SalePrice.
plot(SalePrice ~ `Gr Liv Area`, data = housing_data,
xlab = "Living Area (square feet)",
ylab = "Sale Price (US Dollars)",
main = "Sale Price vs Ground Living Area",
pch = 20,
cex = .5,
col = "grey")
abline(livarea_model, col="darkorange", lwd=2)
However, we also observe that the interaction model between Neighborhood and Gr.Liv.Area yielded a near-zero p-value of 9.87210^{-75}. This suggests a significant interaction between the two predictors when predicting SalePrice.
signif(p_value_neigh_livarea_interact,4)
## [1] 9.872e-75
print(neigh_livarea_anova[-4,], signif.stars=FALSE)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## Gr.Liv.Area 1 8.7037e+12 8.7037e+12 6754.782 < 2.2e-16
## Neighborhood 25 3.5875e+12 1.4350e+11 111.370 < 2.2e-16
## Gr.Liv.Area:Neighborhood 25 5.9450e+11 2.3780e+10 18.455 < 2.2e-16
Because of this, we obtain different slopes for Gr.Liv.Area depending on Neighborhood. Each of the slope estimates are the sum of the Gr.Liv.Area coefficient and its interaction term with the corresponding Neighborhood factor dummy variable. As with all estimates, we can calculate a mean and a confidence interval.
## Get the top 5 neighborhoods with the most houses
ordered_neigh = order(-table(housing_data$Neighborhood))
interact_plot(neigh_livarea_interaction_model, pred=Gr.Liv.Area,
modx=Neighborhood,color.class="Qual1",
modxvals = names(table(housing_data$Neighborhood)[ordered_neigh][1:6]),
x.label="Ground Living Area (square feet)",
y.label="Sale Price (US Dollars)") +
ggtitle("Gr.Liv.Area Slopes of the 5 Neighborhoods with Most Sales" ) +
theme(plot.title = element_text(hjust = 0.5))
Figure X: FOOOBAR
kable(livarea_slopes)
| Neighborhood | Est. | Std. Err. | 2.5% | 97.5% |
|---|---|---|---|---|
| Blmngtn | 130.50 | 45.260 | 41.7000 | 219.20 |
| Blueste | -14.07 | 71.290 | -153.9000 | 125.70 |
| BrDale | 58.43 | 46.600 | -32.9500 | 149.80 |
| BrkSide | 72.45 | 11.330 | 50.2400 | 94.67 |
| ClearCr | 56.54 | 14.830 | 27.4500 | 85.63 |
| CollgCr | 106.10 | 5.434 | 95.4200 | 116.70 |
| Crawfor | 94.77 | 7.872 | 79.3400 | 110.20 |
| Edwards | 89.47 | 8.154 | 73.4800 | 105.50 |
| Gilbert | 77.71 | 9.873 | 58.3500 | 97.07 |
| Greens | 100.00 | 81.290 | -59.3600 | 259.50 |
| IDOTRR | 55.44 | 10.810 | 34.2400 | 76.64 |
| MeadowV | 38.62 | 21.180 | -2.9100 | 80.15 |
| Mitchel | 67.29 | 9.513 | 48.6300 | 85.94 |
| NAmes | 54.61 | 4.540 | 45.7100 | 63.51 |
| NoRidge | 140.00 | 11.060 | 118.4000 | 161.70 |
| NPkVill | 25.16 | 30.690 | -35.0200 | 85.34 |
| NridgHt | 163.70 | 6.299 | 151.3000 | 176.00 |
| NWAmes | 67.46 | 7.250 | 53.2400 | 81.67 |
| OldTown | 60.69 | 4.769 | 51.3400 | 70.04 |
| Sawyer | 38.81 | 8.763 | 21.6300 | 55.99 |
| SawyerW | 84.56 | 7.220 | 70.4000 | 98.71 |
| Somerst | 123.10 | 8.449 | 106.5000 | 139.70 |
| StoneBr | 160.30 | 8.595 | 143.5000 | 177.20 |
| SWISU | 40.64 | 11.330 | 18.4200 | 62.86 |
| Timber | 105.70 | 12.060 | 82.0700 | 129.40 |
| Veenker | 27.89 | 14.470 | -0.4871 | 56.26 |
Table X: FOOBAR
Having observed how both Neighborhood and Gr.Liv.Area interact with each other and influence SalePrice, it is important to highlight that it does not imply a casual relationship between them. This becomes more evident when we examine the partial correlation coefficient of Gr.Liv.Area and SalePrice with the effect of all other 78 predictors removed. We observe that its magnitude is less than 0.01.
without_livarea = lm(SalePrice ~ . -`Gr Liv Area` , data=housing_data)
livarea_collinearity = lm(`Gr Liv Area`~ . -SalePrice, data=housing_data)
(partial_corr_livarea_saleprice = cor(resid(without_livarea),resid(livarea_collinearity)))
## [1] 0.003903368
Therefore, ground living area is collinear with many of the other predictors available in the dataset, and does not independently predict price. We can, however, reject the null hypothesis that the average price per square foot of living space is equal for all neighborhoods with a confidence level \(\alpha= 0.01\).